The goal of this exercice is to decipher relevant knowledges from Baskets DataSet and boost the business by prositionning products (items).
# get the structure of the dataset
str(basket)
## 'data.frame': 38765 obs. of 3 variables:
## $ Member_number : int 1808 2552 2300 1187 3037 4941 4501 3803 2762 4119 ...
## $ Date : Factor w/ 728 levels "01-01-2014","01-01-2015",..: 494 98 450 288 4 316 178 552 462 268 ...
## $ itemDescription: Factor w/ 167 levels "abrasive cleaner",..: 156 165 109 102 165 122 102 112 165 156 ...
The dataset is a dataframe with 3 columns: Member_number, Date, and Products description. The Member_number column must be considered as factor, like the id of member. The Date must be a date type. The itemDascription is already as.factor
library(magrittr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tibble)
basket$Member_number <- as.factor(basket$Member_number)
basket$Date <- as.Date(basket$Date, "%d-%m-%Y")
# In the case if we have the transaction date as 2010-12-01 08:26:00
# We can extract time from the column and store in another variable:
# TransTime<- format(basket$Date,"%H:%M:%S")
# This line create an other column with Date YMD.
# convert the dataframe as a tibble
basket <- tibble::as.tibble(basket)
# review the structure and summary
str(basket)
## Classes 'tbl_df', 'tbl' and 'data.frame': 38765 obs. of 3 variables:
## $ Member_number : Factor w/ 3898 levels "1000","1001",..: 787 1505 1264 181 1982 3839 3407 2732 1712 3037 ...
## $ Date : Date, format: "2015-07-21" "2015-01-05" ...
## $ itemDescription: Factor w/ 167 levels "abrasive cleaner",..: 156 165 109 102 165 122 102 112 165 156 ...
summary(basket)
## Member_number Date itemDescription
## 3180 : 36 Min. :2014-01-01 whole milk : 2502
## 2051 : 33 1st Qu.:2014-07-12 other vegetables: 1898
## 3050 : 33 Median :2015-01-21 rolls/buns : 1716
## 3737 : 33 Mean :2015-01-09 soda : 1514
## 2271 : 31 3rd Qu.:2015-07-10 yogurt : 1334
## 2433 : 31 Max. :2015-12-30 root vegetables : 1071
## (Other):38568 (Other) :28730
basket
Now is better formated. We can observe: + The Member_number 3180 has the biggest operations (36). + The Whole milk product is the most sold with 2502 operations. + The sampling data was done during two years, between 2014-01-01 and 2015-12-30.
Before to start the exploration, It is important to check if the dataset countain empty cells or non available data.
# Ckeck for Non Available (NA) cell
all(is.na(basket))
## [1] FALSE
# FALSE
# otherwise we can use #complete.cases(data) will return a logical vector indicating which rows have no missing values.
#Then use the vector to get only rows that are complete using basket[,].
basket <- basket[complete.cases(basket), ]
basket
# get a glimpse of the data
tibble::glimpse(basket)
## Observations: 38,765
## Variables: 3
## $ Member_number <fct> 1808, 2552, 2300, 1187, 3037, 4941, 4501, 3803...
## $ Date <date> 2015-07-21, 2015-01-05, 2015-09-19, 2015-12-1...
## $ itemDescription <fct> tropical fruit, whole milk, pip fruit, other v...
The dataset is composed by three variables that describe the money transaction of multiple costumers that bought mutiple products recurrently during a periode of time (date).
We need to determine how many of costumer.
We need to determine how many of products avaibale in this dataset.
We need to know if is there multiple products that bought together (forming a basket or couffin) for each costumer at a time.
This dataset gives us a data for all transactions that consists of Products bought in the store by several customers over a period of time. The Market Basket Analysis (MBA) uses Association Rule mining on the given transaction data. The goal is to use this data to boost the business. The idea this to change the layout of the physical store or rather an online store. For example, put particular product with bad sold rate near to a product with a good rate of sold to boost the business.
What we need to do is to group data by Member_number, and Date. We need this grouping and apply a function on it and store the output in another dataframe.
The following lines of code will combine all products from one Member_number and Date and combine all products from one transaction as one row, with each product, separated by ,.
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
transaction_list <- plyr::ddply(basket,c("Member_number","Date"),
function(df1)paste(df1$itemDescription,
collapse = ","))
colnames(transaction_list) <- c("Member_number","Date","Baskets")
transaction_list
Well, at all we have 14,963 transactions. Each transaction is composed by a set of products viewed in the colmun Baskets. Each a set of products is named a Basket or Couffin. In the following steps we need only the informations in Baskets column. We will save it in a file as csv format (comma separate Values).
# Back-Up the transaction List with Member number and Dates.
transaction_list_bkp <- transaction_list
#set column Member_number of dataframe transaction_list
transaction_list$Member_number <- NULL
#set column Date of dataframe transaction_list
transaction_list$Date <- NULL
#Rename column to Baskets
colnames(transaction_list) <- c("Baskets")
#Show Dataframe transactionData
transaction_list
arules algorithmelibrary(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
# Write the Baskets list to a file
write.csv(transaction_list,"transactions_list.csv", quote = FALSE, row.names = TRUE)
# load the Baskets list as a basket format using arules package
tr_list <- read.transactions('transactions_list.csv', format = 'basket', sep=',')
## Warning in asMethod(object): removing duplicated items in transactions
tr_list
## transactions in sparse format with
## 14964 transactions (rows) and
## 15131 items (columns)
summary(tr_list)
## transactions as itemMatrix in sparse format with
## 14964 rows (elements/itemsets/transactions) and
## 15131 columns (items) and a density of 0.0002339455
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2363 1827 1646 1453
## yogurt (Other)
## 1285 44396
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11
## 1 205 10012 2727 1273 338 179 113 96 19 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 3.00 3.54 4.00 11.00
##
## includes extended item information - examples:
## labels
## 1 1
## 2 10
## 3 100
14964 transactions and 15131 sold products (One product can be sold several times). Each transaction is a collection of products.0.0002339455 tells the percentage of non-zero cells in a sparse matrix. It is the total number of products that are purchased divided by a possible number of products in the matrix.round(14964 * 15131 * 0.0002339455)
## [1] 52970
number of products / number of transactions. For example, there are 205 transactions with only 2 products, and 2727 transactions with 4 products. We can get directly the The element length distribution by summary(tr_list)@lengths.summary(tr_list)@lengths
## sizes
## 1 2 3 4 5 6 7 8 9 10 11
## 1 205 10012 2727 1273 338 179 113 96 19 1
library(RColorBrewer)
#par(mfrow=c(2,1))
arules::itemFrequencyPlot(tr_list,topN=10,type="absolute",col=brewer.pal(8,'Pastel2'), main="Absolute Product Frequency Plot")
arules::itemFrequencyPlot(tr_list,topN=10,type="relative",col=brewer.pal(8,'Pastel2'), main="Relative Product Frequency Plot")
The option absolute plots numeric frequencies of each product independently. The relative option plots how many times these products have appeared as compared to others.
Theses plots shows the 10 most sold products. We can view more by changing the argument topN.
Whole milk and Other vegetables have the most sales.
To boost the business of citrus fruit, we need foe example to put it in the way of the whole milk or in the way of other vegetables.
library(ggplot2)
# How many transactions by product, the data will be arranged by descending
basket_plot_products <- basket %>% group_by(itemDescription) %>% dplyr::summarise(N_operations=n()) %>% arrange(desc(N_operations))
## plot the histogram of the best 50th sold products
ggplot2::ggplot(basket_plot_products[1:10,], aes(x = reorder(itemDescription, -N_operations), y = N_operations)) +
#aes_string(x = 'itemDescription', y = 'N_operations') ## without ordering
ggplot2::labs(title = "Overview of the first 10th best solds products during 2014-2015",
#fill = "",
x= "The products",
#colour= Date,
y = "The number of sales by product") +
ggplot2::theme(legend.title = element_text( colour="black",
size=11,
face="bold"),
title = element_text( size = 12,
face = 'bold'
),
text = element_text(size = 10,
face= 'bold'
),
axis.text.x=element_text(angle=45, hjust=1),
legend.position = "right",
legend.direction = "vertical"
) +
ggplot2::geom_bar(stat="identity", na.rm=TRUE)
The next step is to mine the rules using the APRIORI algorithm. The function apriori() is from arules package. We can set and optimize the paramter of the minimum support of 0.0001, the minimum confidence of 0.8, maximum of 10 products (maxlen).
#Min Support as 0.0001, confidence as 0.8 and maximum of 10 products.
association.rules <- arules::apriori(tr_list, parameter = list(supp=0.0001, conf=0.8)) #, maxlen=10
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 1e-04 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[15131 item(s), 14964 transaction(s)] done [0.02s].
## sorting and recoding items ... [165 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [647 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
Definition
Itemset: Collection of one or more items. K-item-set means a set of k items.
Support Count: Frequency of occurrence of an item-set
Support(s): Fraction of transactions that contain the item-set
# summary rules of min Support as 0.0001, confidence as 0.8 and maximum of 10 products.
summary(association.rules)
## set of 647 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5
## 135 438 74
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 4.000 3.906 4.000 5.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.0001337 Min. :0.8000 Min. : 5.066 Min. :2.000
## 1st Qu.:0.0001337 1st Qu.:1.0000 1st Qu.: 8.191 1st Qu.:2.000
## Median :0.0001337 Median :1.0000 Median : 11.645 Median :2.000
## Mean :0.0001373 Mean :0.9988 Mean : 19.286 Mean :2.054
## 3rd Qu.:0.0001337 3rd Qu.:1.0000 3rd Qu.: 20.387 3rd Qu.:2.000
## Max. :0.0003341 Max. :1.0000 Max. :364.976 Max. :5.000
##
## mining info:
## data ntransactions support confidence
## tr_list 14964 1e-04 0.8
The total number of rules is: 647
Distribution of rule length: A length of 4 items has the most rules: 438 and a length of 5 items have the lowest number of rules: 74.
inspect(association.rules[1:10])
## lhs rhs support
## [1] {domestic eggs,rubbing alcohol} => {frankfurter} 0.0001336541
## [2] {frankfurter,rubbing alcohol} => {domestic eggs} 0.0001336541
## [3] {bottled water,cookware} => {canned beer} 0.0001336541
## [4] {soap,tropical fruit} => {whole milk} 0.0001336541
## [5] {soap,whole milk} => {tropical fruit} 0.0001336541
## [6] {domestic eggs,skin care} => {other vegetables} 0.0001336541
## [7] {frankfurter,potato products} => {other vegetables} 0.0001336541
## [8] {ice cream,prosecco} => {other vegetables} 0.0001336541
## [9] {prosecco,waffles} => {sausage} 0.0001336541
## [10] {prosecco,waffles} => {other vegetables} 0.0001336541
## confidence lift count
## [1] 1 26.484956 2
## [2] 1 26.962162 2
## [3] 1 21.316239 2
## [4] 1 6.332628 2
## [5] 1 14.757396 2
## [6] 1 8.190476 2
## [7] 1 8.190476 2
## [8] 1 8.190476 2
## [9] 1 16.571429 2
## [10] 1 8.190476 2
Interpretation
100% of the customers who bought ‘domestic eggs,rubbing alcohol’ also bought ‘rankfurter’.
100% of the customers who bought ‘prosecco,waffles’ also bought ‘sausage’.
For example, we would like to know what costumer buy before buying canned beer.
beer.association.rules <- apriori(tr_list, parameter = list(supp=0.0001, conf=0.8),appearance = list(default="lhs",rhs="canned beer"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 1e-04 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[15131 item(s), 14964 transaction(s)] done [0.02s].
## sorting and recoding items ... [165 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [13 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# lhs= canned beer because you want to find out the probability of that in how many customers buy canned beer along with other items
inspect(head(beer.association.rules))
## lhs rhs support confidence lift count
## [1] {bottled water,
## cookware} => {canned beer} 0.0001336541 1 21.31624 2
## [2] {cat food,
## dishes} => {canned beer} 0.0001336541 1 21.31624 2
## [3] {ham,
## salty snack} => {canned beer} 0.0001336541 1 21.31624 2
## [4] {sausage,
## whole milk,
## zwieback} => {canned beer} 0.0001336541 1 21.31624 2
## [5] {brown bread,
## cake bar,
## sausage} => {canned beer} 0.0001336541 1 21.31624 2
## [6] {chicken,
## hygiene articles,
## whole milk} => {canned beer} 0.0001336541 1 21.31624 2
Interpretation
A straight-forward visualization of association rules is to use a scatter plot using plot() of the arulesViz package. It uses Support and Confidence on the axes.
library(arulesViz)
## Loading required package: grid
# Filter rules with confidence greater than 0.4 or 40%
subRules<-association.rules[quality(association.rules)$confidence>0.4]
#Plot SubRules
plot(subRules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
The above plot shows that rules with high lift have low support. You can use the following options:
plot(subRules,method="two-key plot")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
+ The two-key plot uses support and confidence on x and y-axis respectively.
#top10subRules <- head(subRules, n = 10, by = "confidence")
plot(subRules[1:20], method = "graph", engine = "htmlwidget", control = list(type = "items"))
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## itemCol = #CBD2FC
## nodeCol = c("#EE0000", "#EE0303", "#EE0606", "#EE0909", "#EE0C0C", "#EE0F0F", "#EE1212", "#EE1515", "#EE1818", "#EE1B1B", "#EE1E1E", "#EE2222", "#EE2525", "#EE2828", "#EE2B2B", "#EE2E2E", "#EE3131", "#EE3434", "#EE3737", "#EE3A3A", "#EE3D3D", "#EE4040", "#EE4444", "#EE4747", "#EE4A4A", "#EE4D4D", "#EE5050", "#EE5353", "#EE5656", "#EE5959", "#EE5C5C", "#EE5F5F", "#EE6262", "#EE6666", "#EE6969", "#EE6C6C", "#EE6F6F", "#EE7272", "#EE7575", "#EE7878", "#EE7B7B", "#EE7E7E", "#EE8181", "#EE8484", "#EE8888", "#EE8B8B", "#EE8E8E", "#EE9191", "#EE9494", "#EE9797", "#EE9999", "#EE9B9B", "#EE9D9D", "#EE9F9F", "#EEA0A0", "#EEA2A2", "#EEA4A4", "#EEA5A5", "#EEA7A7", "#EEA9A9", "#EEABAB", "#EEACAC", "#EEAEAE", "#EEB0B0", "#EEB1B1", "#EEB3B3", "#EEB5B5", "#EEB7B7", "#EEB8B8", "#EEBABA", "#EEBCBC", "#EEBDBD", "#EEBFBF", "#EEC1C1", "#EEC3C3", "#EEC4C4", "#EEC6C6", "#EEC8C8", "#EEC9C9", "#EECBCB", "#EECDCD", "#EECFCF", "#EED0D0", "#EED2D2", "#EED4D4", "#EED5D5", "#EED7D7", "#EED9D9", "#EEDBDB", "#EEDCDC", "#EEDEDE", "#EEE0E0", "#EEE1E1", "#EEE3E3", "#EEE5E5", "#EEE7E7", "#EEE8E8", "#EEEAEA", "#EEECEC", "#EEEEEE")
## precision = 3
## igraphLayout = layout_nicely
## interactive = TRUE
## engine = visNetwork
## max = 100
## selection_menu = TRUE
## degree_highlight = 1
## verbose = FALSE
The Arrows pointing from products to rule vertices indicate LHS products and an arrow from a rule to an product indicates the RHS.
Interprettaion + The graph shows taht all rules or itineraries focus or whome milk and Vegetables.
tea and frozens vegetables also buy cat food. We should place these in an aisle together.The Parallel Coordinates Plot is useful to visualized which products along with which items cause what kind of sales.
# Filter top 10 rules with highest lift
#subRules2<-head(subRules, n=16, by="lift")
plot(subRules[1:10], method="paracoord", control = list(reorder = TRUE))
Interpretation
The plot shows the itinerary used by the costumers during the shopping.
The top 10 rules show us that when costumers have whole milk and soap, It will be highly likely that they buy tropical fruit.
The whole milk is located at the first position. This means that there is no rule to buy it. Or It is well explosed for customers.
Each arrow corresponds to a rule.
The visualizing graphs and rules, it is possible to make decisions for the positioning of items in the supermarket.
Now, we would place Whole Milk and Vegetables beside each other; tea and frozens vegetables also buy cat food along side too.